data_URL <- "https://tinyurl.com/36h67mhm"
yeast <- read_csv(data_URL)
head(yeast)2023-10-01
Factors in R
Factors represent typically represent categorical variables in R. They are a unique data type that under the hood maps character strings to integers.
Factors can be ordered or unordered
Creating factors from scratch
The code below illustrates the process of creating factor vectors by hand:
# unordered factor
flavors <- factor(c("sweet","sour", "sweet", "salty"))
# ordered factor
sizes <- factor(c("small", "small", "tiny", "large", "medium"),
levels = c("tiny", "small", "medium", "large"),
ordered = TRUE)
flavors[1] sweet sour sweet salty
Levels: salty sour sweet
[1] small small tiny large medium
Levels: tiny < small < medium < large
Factors from continuous data
The cut function is useful for turning numerical data into factors. The key arguments are the breaks specifying the intervals for binning the data and the labels indicating the factor categories you want to create.
★ Factoring example: Your turn
A plot of the Flo11.expr (FLO11 expression data) hints at two or three modes in the distribution, as illustrated below:
Complete the following code to create an ordered factor with the categories “Low”, “Intermediate”, and “High”, indicating a coarse categorization of FLO11 expression as illustrated in the figure above:
If correct, your code should produce the following output:
Pivoting
Pivoting is the act of reshaping our data to make it “longer” or “wider”
longer = fewer columns, more rows
wider = more columns, fewer rows
Pivot longer example
tidyr::pivot_longer is the core function for long pivotingFirst, let’s create a small example data frame to make it easier to see what pivoting is doing:
Now we pivot longer on the replicate “CM” (colony morphology) columns. Note that the information in the headers become entries in the “Replicate” column:
Pivot longer example, cont.
Note that after pivoting, our new “Replicate” column actually contains two pieces of information. The “CM” prefix indicates what the phenotype was, while the letters “a”, “b”, and “c” indicate the replicate. We can improve on the prior code as follows:
★ Pivot longer: Your turn
Modify the code below to generate the figure that follows, showing how the distribution of the adhesiveness phenotype (Adhes.) differs across replicates:
Question: Why might such a plot be useful when considering replicate data? What does the above figure imply about the adhesiveness data?
Pivoting long and wide
Sometimes it’s useful to combined both long and wide pivoting. This is illustrated in the example below where we compare the relationship between the colony morphology scores (CM) and adhesiveness (Adhes) across replicates.
# subset data to focal columns for illustration
cm_and_adhes <-
yeast |>
select(Strain, CM.a:CM.c, Adhes.a:Adhes.c)
head(cm_and_adhes)# pivot longer
long_yeast <-
cm_and_adhes |>
pivot_longer(starts_with(c("CM.","Adhes.")),
names_sep = "\\.", # split on the period in column names
names_to = c("Phenotype", "Replicate"),
values_to = c("Value"))
head(long_yeast)# pivot wider
long_then_wide_yeast <-
long_yeast |>
pivot_wider(names_from = "Phenotype", values_from = "Value")
head(long_then_wide_yeast)As can be seen above, in this case we lengthened than widened, but we generated a different data structure than we started with. This new data frame allows us to generate figures like the following: